Adapting methods described in the poster uploaded on Google Drive, I first calculated the mean pre-pubertal fE2 concentration. Because infants/juvs have elevated fE2 when nursing, we need to exclude these individuals when determining what typical pre-pubertal fE2 looks like.
Per Slack communication on 7/17: defining prebuteral E from
individuals greater than 2.5 years (def not nursing) and under 3 (def
not pubertal)
Per Zoom meeting on 7/22: defining prepubertal E from individuals between 3 and 4 years old.
## calculating average pre-pubertal E2
prepubertal_data<-e_data |>
filter(age_at_sample >= 3.0 & age_at_sample <4) |> # we want age 3..def pre menstruation
select(ind_id, age_at_sample, e_conc_ug)
Averaging the e concentrations for all samples from individuals between 3 and 4 years of age.
prepubertal_e_data <- prepubertal_data |>
summarise(mean_e_conc_ug = mean(e_conc_ug, na.rm = TRUE),
sd_e_conc_ug = sd(e_conc_ug, na.rm = TRUE))
mean_prepubertal_e_conc_ug <- prepubertal_e_data$mean_e_conc_ug
sd_prepubertal_e_conc_ug <- prepubertal_e_data$sd_e_conc_ug
Using the mean + 2SD approach:
calculated_treshold <- mean_prepubertal_e_conc_ug + 2*sd_prepubertal_e_conc_ug
print(paste("Calculated fE2 threshold for menarche: ", calculated_treshold, " ug/g"))
## [1] "Calculated fE2 threshold for menarche: 2.84527656823452 ug/g"
Looking at the average age at which the population has an E reading that crosses the threshold E. Restricting dataset to individuals who we started sampling pre menstruation (before age 3.75). Only considering individuals over 4 years old to be potentially mature (i.e., only considering a threshold crossing at greater than 4 years to be indicative of menarche).
# Identify the first age where e_conc_ug rises above the threshold for each individual
first_above_threshold <- e_data %>%
group_by(ind_id) %>%
filter(min(age_at_sample) <= 3.75) |> # want to make sure we have samples starting before menstruation to assign cutoff age
ungroup() |>
filter(age_at_sample>4) |> # only considering potentially pubescent if older than 4
filter(e_conc_ug > calculated_treshold) %>%
group_by(ind_id) %>%
summarise(first_age_above_threshold = min(age_at_sample, na.rm = TRUE)) %>%
ungroup()
# Calculate the mean and standard error of the ages
mean_age_above_threshold <- mean(first_above_threshold$first_age_above_threshold, na.rm = TRUE)
se_age_above_threshold <- sd(first_above_threshold$first_age_above_threshold, na.rm = TRUE) / sqrt(nrow(first_above_threshold))
print(paste("Average age at which e_conc_ug rises above the threshold: ", mean_age_above_threshold))
## [1] "Average age at which e_conc_ug rises above the threshold: 4.60833333333333"
print(paste("Standard error of the average age: ", se_age_above_threshold))
## [1] "Standard error of the average age: 0.183601319288409"
This plot might be a bit busy, but I think it is a helpful visualization. I am very open to feedback! Here is how to interpret it:
The x-axis is 1-year age-bins
The y-axis is fE2 concentrations (untransformed, in ug/g)
The transparent green
The dashed light blue horizontal line is the calculated prepubertal fE threshold (y = 2.845277 ug/g)
The dashed red vertical line is the calculated average age at puberty (x = 4.641429 years). The shaded region around this is the 95% CI estimate (±1.96*SE)
The transparent green points are all our sample readings for samples from individuals <8 years old at time of sample collection. Unlike the boxplots, which are grouped on the x-axis by 1-year age groups, these green points are positioned on the x-axis at their exact age at sample.
The blue line represents the results of a Loess regression, a non-parametric approach to modeling the relationship between x (age bins) and y (fE2). The gray shading around the blue line is the error of the estimate.
pop_e_data_menarche <- e_data %>%
filter(age_at_sample<8) |> #only wanting to visualize under age 8
mutate(age_group = cut(age_at_sample, breaks = seq(0, max(age_at_sample, na.rm = TRUE) + 1, by = 1), right = FALSE))
## get sample size ----
n_figure_1 <- nrow(pop_e_data_menarche)
##Figure 1
par(mfrow = c(1,1))
# Calculating mean and standard error for each age group
age_summary <- pop_e_data_menarche %>%
group_by(age_group) %>%
summarise(
mean_e_at_age_years = mean(e_conc_ug, na.rm = TRUE),
se_e_at_age_years = sd(e_conc_ug, na.rm = TRUE) / sqrt(n())
) %>%
ungroup()
print("Showing all datapoints with box plots for age-year groupings and smoothing")
## [1] "Showing all datapoints with box plots for age-year groupings and smoothing"
ggplot(data = pop_e_data_menarche, aes(x = as.factor(floor(age_at_sample+1)), y = e_conc_ug)) +
geom_point(data = pop_e_data_menarche, aes(x = age_at_sample, y = e_conc_ug),
alpha = 0.2, color = "green") +
geom_boxplot() +
geom_smooth(data = age_summary, aes(x = as.numeric(age_group), y = mean_e_at_age_years), method = "loess", color = "blue") +
geom_vline(xintercept = mean_age_above_threshold, linetype = "dashed", color = "red") +
geom_hline(yintercept = calculated_treshold, linetype = "dashed", color = "skyblue") +
geom_rect(aes(xmin = mean_age_above_threshold - 1.96 * se_age_above_threshold, xmax = mean_age_above_threshold + 1.96 * se_age_above_threshold,
ymin = -Inf, ymax = Inf), alpha = 0.01, fill = "pink") +
ggtitle(paste0("Estradiol levels by age at sample (N = ", n_figure_1, " samples)")) +
xlab("Age at sample (years)") +
ylab("Fecal estradiol (ug/g)") +
scale_x_discrete(labels = c("0-1", "1-2", "2-3", "3-4", "4-5", "5-6", "6-7", "7-8", "8-9", "9-10"))
## `geom_smooth()` using formula = 'y ~ x'
I am pulling data on females that we have good E coverage for before and after potential menarche…I’m looking for IDs with samples starting at least by age 3.75. At least 25 samples per ID. That leaves us with the following IDs:
*NOTE: excluding TAB because we don’t have any good pre-pubertal samples from her
females_for_menarche_panel<-e_data |>
group_by(ind_id) |>
filter(min(age_at_sample)<3.75) |>
mutate(count = n()) |>
ungroup() |>
select(ind_id,count) |>
arrange(desc(count)) |>
distinct() |>
filter(count>25) |>
filter(ind_id != "TAB") |>
pull(ind_id)
menarche_panel_df<- e_data |>
filter(ind_id %in% females_for_menarche_panel) |>
filter(age_at_sample < 8) |>
select(ind_id, age_at_sample, sample_id, date, month, time, hour, e_conc_ug, is_pregnant) |>
arrange(ind_id, age_at_sample) |>
mutate(date = as.Date(date))
print(females_for_menarche_panel)
## [1] "TER" "MIL" "TAN" "MAY" "MON" "MIS" "MUS"
I am going to calculate the age at menarche for each individual using the same approach as for the population level (mean + 2sd for each individual, only considering ages >4 for potential menarche). Age at menarche for each individual is represented with the dashed red vertical line.
# Calculate average and sd estradiol for each prepubertal ID
prepubertal_e_data_by_individual <- e_data |>
filter(age_at_sample >= 3.0 & age_at_sample < 4) |> # age 3, pre-menarche
select(ind_id, age_at_sample, e_conc_ug) |>
filter(ind_id %in% females_for_menarche_panel) |>
group_by(ind_id) |>
summarise(mean_e_conc_ug = mean(e_conc_ug, na.rm = TRUE),
sd_e_conc_ug = sd(e_conc_ug, na.rm = TRUE))
# Assign threshold by ID
prepubertal_e_data_by_individual <- prepubertal_e_data_by_individual |>
mutate(threshold_by_id = mean_e_conc_ug + 2 * sd_e_conc_ug)
# Loop over individuals to plot estradiol levels
suppressWarnings(
for (individual in females_for_menarche_panel) {
individual_data <- menarche_panel_df |>
filter(ind_id == individual)
threshold_by_individual <- prepubertal_e_data_by_individual |>
filter(ind_id == individual) |>
pull(threshold_by_id)
print(paste("Threshold for", individual, ":", threshold_by_individual, "ug/g"))
# Get the age at which estradiol first exceeds the threshold
first_above_threshold_age_by_id <- individual_data |>
filter(age_at_sample > 4, e_conc_ug > threshold_by_individual) |>
summarise(first_age_above_threshold = min(age_at_sample, na.rm = TRUE)) |>
pull(first_age_above_threshold)
print(paste0("Age at menarche for ", individual, ": ", first_above_threshold_age_by_id, "years"))
# Get duckface dates
duckface_ages <- duckface_data %>%
filter(ind_id == individual) %>%
pull(age_at_sample)
# Create the plot
p <- ggplot(data = individual_data, aes(x = age_at_sample, y = e_conc_ug)) +
geom_point() +
geom_line(color = "blue") +
geom_vline(xintercept = first_above_threshold_age_by_id,
linetype = "dashed", color = "red") +
ggtitle(paste0("Estradiol by age at sample for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
xlab("Age at sample") +
ylab("Fecal estradiol (ug/g)") +
theme(legend.position = "none")
print(p)
}
)
## [1] "Threshold for TER : 1.10006119055108 ug/g"
## [1] "Age at menarche for TER: 4.7years"
## [1] "Threshold for MIL : 0.196851661723125 ug/g"
## [1] "Age at menarche for MIL: 4.42years"
## [1] "Threshold for TAN : 0.47255972819919 ug/g"
## [1] "Age at menarche for TAN: 4.07years"
## [1] "Threshold for MAY : 1.98306642185859 ug/g"
## [1] "Age at menarche for MAY: 4.04years"
## [1] "Threshold for MON : 7.42743013084399 ug/g"
## [1] "Age at menarche for MON: Infyears"
## [1] "Threshold for MIS : 3.03075208724884 ug/g"
## [1] "Age at menarche for MIS: Infyears"
## [1] "Threshold for MUS : 0.699842584095534 ug/g"
## [1] "Age at menarche for MUS: 4.54years"
The purple vertical lines represent recordings of duckface. Visually, it doesn’t seem like duckface observations are associated with menarche.
suppressWarnings(
for (individual in females_for_menarche_panel) {
individual_data <- menarche_panel_df |>
filter(ind_id == individual)
threshold_by_individual <- prepubertal_e_data_by_individual |>
filter(ind_id == individual) |>
pull(threshold_by_id)
print(paste("Threshold for", individual, ":", threshold_by_individual))
# Get the age at which estradiol first exceeds the threshold
first_above_threshold_age_by_id <- individual_data |>
filter(age_at_sample > 4, e_conc_ug > threshold_by_individual) |>
summarise(first_age_above_threshold = min(age_at_sample, na.rm = TRUE)) |>
pull(first_age_above_threshold)
print(paste0("Age at menarche for ", individual, ": ", first_above_threshold_age_by_id))
# Get duckface dates
duckface_ages <- duckface_data %>%
filter(ind_id == individual) %>%
pull(age_at_sample)
# Create the plot
p <- ggplot(data = individual_data, aes(x = age_at_sample, y = e_conc_ug)) +
geom_point() +
geom_line(color = "blue") +
geom_vline(xintercept = first_above_threshold_age_by_id,
linetype = "dashed", color = "red") +
geom_vline(xintercept = as.numeric(duckface_ages),
linetype = "solid", color = "purple") + # Add vertical lines for duckface dates
ggtitle(paste0("Estradiol by age at sample for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
xlab("Age at sample") +
ylab("Fecal estradiol (ug/g)") +
theme(legend.position = "none")
print(p)
}
)
## [1] "Threshold for TER : 1.10006119055108"
## [1] "Age at menarche for TER: 4.7"
## [1] "Threshold for MIL : 0.196851661723125"
## [1] "Age at menarche for MIL: 4.42"
## [1] "Threshold for TAN : 0.47255972819919"
## [1] "Age at menarche for TAN: 4.07"
## [1] "Threshold for MAY : 1.98306642185859"
## [1] "Age at menarche for MAY: 4.04"
## [1] "Threshold for MON : 7.42743013084399"
## [1] "Age at menarche for MON: Inf"
## [1] "Threshold for MIS : 3.03075208724884"
## [1] "Age at menarche for MIS: Inf"
## [1] "Threshold for MUS : 0.699842584095534"
## [1] "Age at menarche for MUS: 4.54"
Considering association if peak in E occurs within 5 days of duckface observation….it doesn’t seem like it does :(. In fact, we don’t have any observations of duckface within 5 days of any potential menarche peak. This could be due to sampling bias! But, in any case, we cannot say that duckface is a signal of menarche with our data.
find_first_rise_date <- function(df, threshold) {
df %>%
arrange(age_at_sample) %>%
filter(e_conc_ug > threshold) %>%
slice(1) %>%
pull(date)
}
# Statistical analysis to check association between duckface and first rise in estradiol
association_results <- data.frame(ind_id = character(), first_rise_date = as.Date(character()), duckface_within_2_days = logical(), duckface_dates = character(), stringsAsFactors = FALSE)
for (individual in females_for_menarche_panel) {
individual_data <- menarche_panel_df %>%
filter(ind_id == individual)
first_rise_date <- find_first_rise_date(individual_data, calculated_treshold)
duckface_dates <- duckface_data %>%
filter(ind_id == individual) %>%
pull(date)
date_differences <- as.numeric(difftime(duckface_dates, first_rise_date, units = "days"))
duckface_within_5_days <- any(abs(date_differences) <= 5)
# Get the duckface dates within 2 days
duckface_dates_within_5_days <- duckface_dates[abs(date_differences) <= 5]
# Join the duckface dates within 2 days into a single string
duckface_dates_str <- paste(duckface_dates_within_5_days, collapse = ", ")
association_results <- rbind(association_results, data.frame(ind_id = individual, first_rise_date = first_rise_date, duckface_within_5_days = duckface_within_5_days, duckface_dates = duckface_dates_str, stringsAsFactors = FALSE))
}
print(association_results)
## ind_id first_rise_date duckface_within_5_days duckface_dates
## 1 TER 2021-03-22 FALSE
## 2 MIL 2021-06-09 FALSE
## 3 TAN 2021-03-26 FALSE
## 4 MAY 2021-04-22 FALSE
## 5 MON 2021-06-03 FALSE
## 6 MIS 2021-06-09 FALSE
## 7 MUS 2023-05-11 FALSE
Pretty simple, just counting the days between local E peaks for each of our adult females.
# Panel of all individual females who have good E coverage before and after menarche
females_for_cycling_panel<-e_data |>
group_by(ind_id) |>
filter(min(age_at_sample)>4) |>
mutate(count = n()) |>
ungroup() |>
select(ind_id,count) |>
arrange(desc(count)) |>
distinct() |>
filter(count>25) |>
pull(ind_id)
females_for_cycling_panel_df<- e_data |>
filter(ind_id %in% females_for_cycling_panel) |>
mutate(date = as.Date(date)) |>
# filter(date > as.Date("2021-12-31")) |>
select(ind_id, age_at_sample, sample_id, date, month, time, hour, e_conc_ug, is_pregnant) |>
arrange(ind_id, age_at_sample)
n_figure_2 <-nrow(females_for_cycling_panel_df)
create_continuous_segments <- function(df) {
df %>%
arrange(date) %>%
mutate(segment = cumsum(is_pregnant == "P") + 1) %>%
group_by(segment) %>%
filter(any(is_pregnant == "NP")) %>%
ungroup() %>%
mutate(segment = cumsum(is_pregnant == "P") + 1) %>%
filter(is_pregnant == "NP")
}
# Function to find local peaks
find_peaks <- function(df) {
df %>%
arrange(date) %>%
mutate(peak = rollapply(e_conc_ug, width = 3, FUN = function(x) x[2] == max(x), fill = NA, align = 'center')) %>%
filter(peak == TRUE) %>%
select(ind_id, date, e_conc_ug, is_pregnant)
}
# Function to calculate cycle lengths
calculate_cycle_lengths <- function(df) {
df <- df |>
filter(is_pregnant == "NP")
peaks <- find_peaks(df)
peaks <- peaks %>%
arrange(date) %>%
mutate(next_date = lead(date)) %>%
mutate(cycle_length = as.numeric(difftime(next_date, date, units = "days"))) %>%
filter(!is.na(cycle_length))
# Filter to only include cycle lengths within 45 days
valid_cycles <- peaks %>%
filter(cycle_length <= 45)
# Calculate average cycle length
avg_cycle_length <- valid_cycles %>%
summarise(avg_cycle_length = mean(cycle_length, na.rm = TRUE))
return(avg_cycle_length)
}
# Initialize a vector to store all cycle lengths
all_cycle_lengths <- c()
# Loop through each individual and create a ggplot for each
for (individual in females_for_cycling_panel) {
individual_data <- females_for_cycling_panel_df %>%
filter(ind_id == individual)
# Calculate cycle lengths for this individual
cycle_lengths <- calculate_cycle_lengths(individual_data)
# Add this individual's cycle lengths to the overall list
all_cycle_lengths <- c(all_cycle_lengths, cycle_lengths)
# Create continuous segments based on NP periods
segmented_data <- create_continuous_segments(individual_data)
p<-ggplot(data = individual_data, aes(x = date, y = e_conc_ug)) +
geom_point(aes(color = is_pregnant)) +
ggtitle(paste0("Estradiol by date for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
xlab("Date") +
ylab("Fecal estradiol (ug/g)") +
theme(legend.position = "right") +
labs(color = "Pregnancy Status")+
geom_line(data = segmented_data, aes(group = segment), color = "blue") # Plot lines for continuous NP segments
print(p)
# Print cycle lengths for this individual
print(paste0("Average cycle length for individual ", individual, ": ", cycle_lengths$avg_cycle_length, " days"))
}
## [1] "Average cycle length for individual TRU: 16.3636363636364 days"
## [1] "Average cycle length for individual TAM: 23.0909090909091 days"
## [1] "Average cycle length for individual TES: 18.4285714285714 days"
## [1] "Average cycle length for individual MUL: 18.1666666666667 days"
## [1] "Average cycle length for individual MEZ: 23.6 days"
## [1] "Average cycle length for individual TET: 22.3 days"
## [1] "Average cycle length for individual TEL: 21 days"
# Calculate overall average and standard error of cycle lengths
all_cycle_lengths<-as.numeric(all_cycle_lengths)
# Calculate overall average and standard error of cycle lengths
overall_avg_cycle_length <- mean(all_cycle_lengths, na.rm = TRUE)
overall_se_cycle_length <- sd(all_cycle_lengths, na.rm = TRUE) / sqrt(length(all_cycle_lengths))
print(paste0("Overall average cycle length: ", overall_avg_cycle_length, " days"))
## [1] "Overall average cycle length: 20.4213976499691 days"
print(paste0("Overall standard error of cycle lengths: ", overall_se_cycle_length, " days"))
## [1] "Overall standard error of cycle lengths: 1.05350126578618 days"
Purple lines = observed duck face
for (individual in females_for_cycling_panel) {
individual_data <- females_for_cycling_panel_df %>%
filter(ind_id == individual)
# Calculate cycle lengths for this individual
cycle_lengths <- calculate_cycle_lengths(individual_data)
# Create continuous segments based on NP periods
segmented_data <- create_continuous_segments(individual_data)
# Get duckface dates for this individual
duckface_dates <- duckface_data %>%
filter(ind_id == individual) %>%
pull(date)
p <- ggplot(data = individual_data, aes(x = date, y = e_conc_ug)) +
geom_point(aes(color = is_pregnant)) +
ggtitle(paste0("Estradiol by date for individual ", individual, " (n = ", nrow(individual_data), " samples)")) +
xlab("Date") +
ylab("Fecal estradiol (ug/g)") +
theme(legend.position = "none") +
geom_line(data = segmented_data, aes(group = segment), color = "blue") + # Plot lines for continuous NP segments
geom_vline(xintercept = as.numeric(duckface_dates), linetype = "solid", color = "purple") + # Add vertical lines for duckface dates
geom_vline(xintercept = as.Date(cycle_lengths$avg_cycle_length), linetype = "dashed", color = "green")
print(p)
# Print cycle lengths for this individual
print(paste0("Average cycle length for individual ", individual, ": ", cycle_lengths$avg_cycle_length, " days"))
}
## [1] "Average cycle length for individual TRU: 16.3636363636364 days"
## [1] "Average cycle length for individual TAM: 23.0909090909091 days"
## [1] "Average cycle length for individual TES: 18.4285714285714 days"
## [1] "Average cycle length for individual MUL: 18.1666666666667 days"
## [1] "Average cycle length for individual MEZ: 23.6 days"
## [1] "Average cycle length for individual TET: 22.3 days"
## [1] "Average cycle length for individual TEL: 21 days"
# Combine data from all individuals
all_peaks <- data.frame()
all_duckface_events <- data.frame()
for (individual in females_for_cycling_panel) {
individual_data <- females_for_cycling_panel_df %>%
filter(ind_id == individual)
# Find peaks for this individual
peaks <- find_peaks(individual_data)
all_peaks <- rbind(all_peaks, peaks)
# Get duckface dates for this individual
duckface_dates <- duckface_data %>%
filter(ind_id == individual) %>%
pull(date)
duckface_events <- data.frame(ind_id = individual, date = duckface_dates)
all_duckface_events <- rbind(all_duckface_events, duckface_events)
}
# Check for duckface events within ±4 days of peak estradiol dates
all_peaks <- all_peaks %>%
mutate(duckface_within_4_days = sapply(date, function(peak_date) {
any(abs(difftime(all_duckface_events$date[all_duckface_events$ind_id == ind_id], peak_date, units = "days")) <= 4)
}))
# Summarize results
summary_table <- all_peaks %>%
group_by(ind_id) %>%
summarise(
total_peaks = n(),
peaks_with_duckface = sum(duckface_within_4_days),
proportion_with_duckface = mean(duckface_within_4_days)
)
print(summary_table)
## # A tibble: 7 × 4
## ind_id total_peaks peaks_with_duckface proportion_with_duckface
## <chr> <int> <int> <dbl>
## 1 MEZ 22 3 0.136
## 2 MUL 21 5 0.238
## 3 TAM 30 9 0.3
## 4 TEL 16 6 0.375
## 5 TES 29 9 0.310
## 6 TET 22 9 0.409
## 7 TRU 31 10 0.323
We want to formally test if duckface covaries with E such that we can map E onto the ovarian cycle. We would expect duckface to coincide with ovulation, so I am looking for duckface occurrences within 5 days of local E peaks.
NOTE: this is a LOT of code/data wrangling…I did my best to annotate it, but please let me know if I can clarify anything.
e_data$date<-as.Date(e_data$date)
# Calculate cumulative rainfall over the last 90 days
calculate_weather_metrics <- function(combined_df, weather_data) {
# Join the weather data with the combined dataframe
combined_df_with_weather <- combined_df %>%
left_join(weather_data, by = "date") %>%
arrange(ind_id, date)
# Calculate cumulative rainfall for the past 30 days
combined_df_with_weather <- combined_df_with_weather %>%
mutate(
mean_rain_last_90_days = rollapply(rainfall, width = 90, FUN = mean, align = "right", fill = NA, na.rm = TRUE),
mean_max_temp_last_30_days = rollapply(temperature, width = 30, FUN = mean, align = "right", fill = NA, na.rm = TRUE)
) %>%
ungroup()
return(combined_df_with_weather)
}
# finding duckface and E data within 5 days of each other
expanded_duckface_data_5_days <- duckface_data %>%
rowwise() %>%
do(data.frame(ind_id = .$ind_id,
date = seq(.$date - days(3), .$date + days(2), by = "day"),
age_at_sample = .$age_at_sample,
original_duckface_date = .$date
)) %>%
ungroup()
duckface_e_df_5_days <- expanded_duckface_data_5_days %>%
inner_join(e_data, by = c("ind_id", "date")) |>
select(ind_id, sample_id, age_at_sample.x, date, original_duckface_date, e_conc_ug) %>%
rename("sample_date" = date,
"age_at_sample" = age_at_sample.x) %>%
distinct(sample_id, .keep_all = TRUE) %>% # in case there are more than 1 duckfaces observed by sampled date, only keep 1
mutate("duckface_sample" = 1L)
duckface_sampleids_5_days<-duckface_e_df_5_days |>
pull(sample_id)
non_duckface_e_df_5_days <- e_data |>
filter(!sample_id %in% duckface_sampleids_5_days) |>
select(ind_id, sample_id, age_at_sample, date, e_conc_ug) |>
mutate("duckface_sample" = 0L)
combined_df_5_days <- bind_rows(duckface_e_df_5_days, non_duckface_e_df_5_days) |>
mutate(date = coalesce(sample_date, date)) %>%
select(-sample_date)
# calculating rainfall and maxtemp in past 30 days of e sample
combined_df_5_days <- calculate_weather_metrics(combined_df_5_days, weather_data) |>
filter(!is.na(mean_rain_last_90_days)) |>
filter(!is.na(mean_max_temp_last_30_days))
# breaking df into young and old based on age below average E threshold and age above threshold
combined_df_5_days_young<-combined_df_5_days |>
filter(age_at_sample < mean_age_above_threshold)
combined_df_5_days_mature <- combined_df_5_days |>
filter(age_at_sample >= mean_age_above_threshold)
Data is still not normal following log transformation. I also tried excluding outlier points using z-scores but still, not normal. I think we have a large enough sample that we can still run parametric tests on these data, but please let me know if you disagree.
shapiro.test(log(combined_df_5_days_mature$e_conc_ug))
##
## Shapiro-Wilk normality test
##
## data: log(combined_df_5_days_mature$e_conc_ug)
## W = 0.99429, p-value = 0.0004842
Starting with a simple T-test (not adjusting for anything!), let’s see if duckface E samples are significantly higher than non-duckface E samples.
t_test_result_5_days <- t.test(log(e_conc_ug) ~ duckface_sample, data = combined_df_5_days_mature)
print(t_test_result_5_days)
##
## Welch Two Sample t-test
##
## data: log(e_conc_ug) by duckface_sample
## t = -3.0925, df = 105.89, p-value = 0.002538
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.7819459 -0.1709997
## sample estimates:
## mean in group 0 mean in group 1
## 0.1860851 0.6625579
ggplot(data = combined_df_5_days_mature, aes(x = factor(duckface_sample), y = log(e_conc_ug))) +
geom_boxplot()
So yes, a simple T-test shows us that samples within 5 days of duckface observations have significantly greater E than samples not within 5 days of duckface observations. BUT, this modeling approach is quite crude. Let’s control/adjust for other factors that will influence this relationship….
I am using linear mixed-effects model averaging with the
'MuMIn' and 'lme4' packages in R. The
following variables are being loaded into the full model:
| Fixed/Random | Predictor | Description |
|---|---|---|
| Fixed | Age at sample | Age of individual at which the sample was collected |
| Fixed | Duckface sample | A binary variable for whether or not the sample was collected within 5 days of a duckface observation |
| Fixed | Mean rain last 90 days | Average rainfall across the 90 days leading up to sample collection—from MERRA 2. NOTE: this is a scaled, continuous variable |
| Fixed | Mean maximum temperature last 30 days | Average maximum temperature across the 90 days leading up to sample collection—from MERRA 2 NOTE: this is a scaled, continuous variable |
| Random | Individual ID | Female ID |
full_formula_lmer_mature <- lmer(log(e_conc_ug) ~ age_at_sample + duckface_sample +
scale(mean_rain_last_90_days) +scale(mean_max_temp_last_30_days) + (1 | ind_id),
data = combined_df_5_days_mature,
na.action = "na.fail")
summary(full_formula_lmer_mature)
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## log(e_conc_ug) ~ age_at_sample + duckface_sample + scale(mean_rain_last_90_days) +
## scale(mean_max_temp_last_30_days) + (1 | ind_id)
## Data: combined_df_5_days_mature
##
## REML criterion at convergence: 3631.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8783 -0.6464 -0.0328 0.6695 3.2860
##
## Random effects:
## Groups Name Variance Std.Dev.
## ind_id (Intercept) 0.4281 0.6543
## Residual 1.7175 1.3105
## Number of obs: 1054, groups: ind_id, 31
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.330900 0.241935 1.368
## age_at_sample 0.004033 0.016399 0.246
## duckface_sample 0.394228 0.147981 2.664
## scale(mean_rain_last_90_days) 0.051769 0.061894 0.836
## scale(mean_max_temp_last_30_days) -0.068751 0.047051 -1.461
##
## Correlation of Fixed Effects:
## (Intr) ag_t_s dckfc_ s(___9
## age_at_smpl -0.823
## duckfc_smpl -0.061 0.027
## scl(___90_) -0.012 0.080 0.004
## sc(____30_) -0.114 0.076 -0.076 -0.219
plot_model(full_formula_lmer_mature, type = "est", show.values = TRUE, value.offset = .3, title = "Log(Estradiol)", show.intercept = TRUE)
NOTE; when model averaging with a delta AICc parameter of 2, there is actually only one model that minimizes AIC, and it only includes duckface as a predictor.
So, our best model is actually just
log(e_conc_ug) ~ intercept + duckface_sample + (1|ind_id)
dredged_model_lmer_mature <- dredge(full_formula_lmer_mature)
## Warning in dredge(full_formula_lmer_mature): comparing models fitted by REML
## Fixed term is "(Intercept)"
print(dredged_model_lmer_mature)
## Global model call: lmer(formula = log(e_conc_ug) ~ age_at_sample + duckface_sample +
## scale(mean_rain_last_90_days) + scale(mean_max_temp_last_30_days) +
## (1 | ind_id), data = combined_df_5_days_mature, na.action = "na.fail")
## ---
## Model selection table
## (Int) age_at_smp dck_smp scl(men_max_tmp_lst_30_dys)
## 3 0.3468 0.3803
## 1 0.3746
## 7 0.3617 0.3944 -0.05827
## 11 0.3524 0.3788
## 5 0.3877 -0.04888
## 9 0.3808
## 4 0.3151 0.0028450 0.3803
## 15 0.3756 0.3938 -0.06776
## 2 0.3554 0.0017680
## 13 0.4014 -0.05833
## 8 0.3419 0.0018470 0.3943 -0.05880
## 12 0.3049 0.0042610 0.3788
## 6 0.3801 0.0008133 -0.04951
## 10 0.3445 0.0032780
## 16 0.3309 0.0040330 0.3942 -0.06875
## 14 0.3695 0.0029220 -0.05928
## scl(men_ran_lst_90_dys) df logLik AICc delta weight
## 3 4 -1809.545 3627.1 0.00 0.672
## 1 3 -1811.858 3629.7 2.61 0.182
## 7 5 -1810.915 3631.9 4.76 0.062
## 11 0.02349 5 -1811.375 3632.8 5.68 0.039
## 5 4 -1813.463 3635.0 7.84 0.013
## 9 0.02723 4 -1813.659 3635.4 8.23 0.011
## 4 5 -1812.796 3635.6 8.52 0.009
## 15 0.04725 6 -1812.515 3637.1 9.98 0.005
## 2 4 -1815.106 3638.3 11.12 0.003
## 13 0.04762 5 -1815.056 3640.2 13.04 0.001
## 8 6 -1814.148 3640.4 13.25 0.001
## 12 0.02790 6 -1814.590 3641.3 14.13 0.001
## 6 5 -1816.694 3643.4 16.32 0.000
## 10 0.03088 5 -1816.875 3643.8 16.68 0.000
## 16 0.05177 7 -1815.695 3645.5 18.37 0.000
## 14 0.05137 6 -1818.243 3648.6 21.44 0.000
## Models ranked by AICc(x)
## Random terms (all models):
## 1 | ind_id
# averaged_model_lmer_mature <- model.avg(dredged_model_lmer_mature, subset = delta < 2)
best_model_duckface_lmer <- lmer(log(e_conc_ug) ~ duckface_sample +
(1 | ind_id),
data = combined_df_5_days_mature,
na.action = "na.fail")
summary(best_model_duckface_lmer)
## Linear mixed model fit by REML ['lmerMod']
## Formula: log(e_conc_ug) ~ duckface_sample + (1 | ind_id)
## Data: combined_df_5_days_mature
##
## REML criterion at convergence: 3619.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8625 -0.6450 -0.0410 0.6742 3.2789
##
## Random effects:
## Groups Name Variance Std.Dev.
## ind_id (Intercept) 0.341 0.584
## Residual 1.724 1.313
## Number of obs: 1054, groups: ind_id, 31
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.3468 0.1244 2.787
## duckface_sample 0.3803 0.1477 2.576
##
## Correlation of Fixed Effects:
## (Intr)
## duckfc_smpl -0.083
plot_model(best_model_duckface_lmer, type = "est", show.values = TRUE, value.offset = .3, title = "Log(Estradiol)", show.intercept = TRUE)
I would appreciate someone else’s opinion, but I am inclined to say that our model passes diagnostics. I am a bit concerned about the heteroskedasticity though, and we already know that our e_data isn’t normal….
# checking model diagnostics
plot(best_model_duckface_lmer) #resids vs leverage -- pretty good, maybe some heteroskedasticity
lattice::qqmath(best_model_duckface_lmer) ## qq norm plot -- pretty good...some tails
plot(best_model_duckface_lmer, # scale location plot
sqrt(abs(resid(.)))~fitted(.),
type=c("p","smooth"), col.line=1)
first_birth_data <- life_history_data |>
filter(Event %in% c("DOB", "First Birth")) |>
select(ID, Event, Date) |>
pivot_wider(names_from = Event, values_from = Date) |>
filter(!is.na(`First Birth`)) |> # Keep only individuals with a First Birth value
mutate(Age_at_First_Birth = (as.numeric(`First Birth`) - as.numeric(DOB))/365)
# Calculate the average age at first birth and its standard error
age_stats <- first_birth_data |>
summarize(
Average_Age = mean(Age_at_First_Birth, na.rm = TRUE),
Std_Dev_Age = sd(Age_at_First_Birth, na.rm = TRUE),
Sample_Size = n()
) |>
mutate(
Std_Error_Age = Std_Dev_Age / sqrt(Sample_Size)
)
# Display the results
kable(age_stats)
| Average_Age | Std_Dev_Age | Sample_Size | Std_Error_Age |
|---|---|---|---|
| 6.190776 | 1.021683 | 15 | 0.2637973 |
I am not sure how to go about doing this…E seems to bounce around a LOT throughout gestation. The green dashed line on each plot represents the female’s average non-pregnant E.
e_data <- e_data |>
mutate(across(c(B2,B3,B4), ~ na_if(.x, " ")))
e_data$B1 <- as.Date(e_data$B1)
e_data$B2 <- as.Date(e_data$B2)
e_data$B3 <- as.Date(e_data$B3)
e_data$B4 <- as.Date(e_data$B4)
e_data$date <- as.Date(e_data$date)
# # Calculate the difference in months
#
gestation_df_long <- e_data %>%
pivot_longer(cols = c(B1, B2, B3, B4), names_to = "birth_number", values_to = "birthing_date")
gestation_df <- gestation_df_long |>
mutate(months_from_birth = interval(date, birthing_date) / months(1)) |>
select(ind_id, age_at_sample, month, months_from_birth, e_conc_ug, birth_number)
gestating_df<- gestation_df |>
filter(months_from_birth >= 0 & months_from_birth <= 9) |>
mutate(months_from_birth = -1*months_from_birth)
gestating_df_females<-gestating_df |>
distinct(ind_id) |>
pull(ind_id)
not_gestating_df<-gestation_df |>
filter(is.na(months_from_birth) | months_from_birth > 9) |>
filter(age_at_sample > 6) |>
group_by(ind_id) |>
summarise(non_gestating_e2 = mean(e_conc_ug))
# Plot the data for each individual
for (individual in gestating_df_females) {
individual_data <- gestating_df %>%
filter(ind_id == individual)
# Establish baseline E2 for the individual
baseline_e <- not_gestating_df |>
filter(ind_id == individual) |>
pull(non_gestating_e2)
p <- ggplot(data = individual_data, aes(x = months_from_birth, y = e_conc_ug, color = birth_number, linetype = birth_number)) +
geom_point() +
geom_line() +
geom_hline(yintercept = baseline_e, linetype = "dashed", color = "green") +
ggtitle(paste0("Estradiol and gestation for ", individual, " (n = ", nrow(individual_data), " samples)")) +
xlab("Months from birth") +
ylab("Fecal estradiol (ug/g)") +
ylim(0, 150) +
xlim(-9, 0) +
theme(legend.position = "right") +
labs(color = "Birth Number", linetype = "Birth Number")
print(p)
}
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
This is just using the trimester estimates that were already in the dataset because, as I say above, I do not know a good way to use our E values to assign gestation lengths. It bounces around SO much!
Restricting dataset to adults
(age_at_sample>mean_age_above_threshold)
pregnant_df<-e_data |>
select(ind_id, month, hour, age_at_sample, e_conc_ug, preg_trim, pregnancy_num, is_pregnant) |>
filter(age_at_sample > mean_age_above_threshold) |>
mutate(preg_trim = factor(preg_trim,
levels = c("", "T1", "T2", "T3"),
labels = c("Not pregnant", "T1", "T2", "T3")))
n_pregnant_df <- nrow(pregnant_df)
Providing both unchanged and log-transformed visualizations.
par(mfrow = c(1,1))
# non transformed
ggplot(data = pregnant_df, aes(x = preg_trim, y = e_conc_ug))+
geom_boxplot()+
geom_jitter(alpha = 0.1)+
ggtitle(paste("Estradiol and gestation (N = ", n_pregnant_df, "samples)")) +
xlab("Pregnancy trimester")+
ylab("Fecal estradiol (ug/g)")+
scale_x_discrete(labels = c("Not pregnant", "T1", "T2", "T3"))
# log transformed
ggplot(data = pregnant_df, aes(x = preg_trim, y = log(e_conc_ug)))+
geom_boxplot()+
geom_jitter(alpha = 0.1)+
ggtitle(paste("Estradiol and gestation (N = ", n_pregnant_df, "samples)")) +
xlab("Pregnancy trimester")+
ylab("Log(Fecal estradiol [ug/g])")+
scale_x_discrete(labels = c("Not pregnant", "T1", "T2", "T3"))
Data is still not normal following log transformation
shapiro.test(pregnant_df$e_conc_ug)
##
## Shapiro-Wilk normality test
##
## data: pregnant_df$e_conc_ug
## W = 0.27018, p-value < 2.2e-16
shapiro.test(log(pregnant_df$e_conc_ug))
##
## Shapiro-Wilk normality test
##
## data: log(pregnant_df$e_conc_ug)
## W = 0.99424, p-value = 0.0003329
Doing non-parametric tests (Kruskall-Wallis, pairwise Mann Whitney U Tests) to see how E varies from non-pregnant individuals to pregnant individuals across the three trimesters.
We can see pregnancy status is significantly associated with E according to the Kruskall-Wallis test (chi-squared = 196.85, df = 3, p<0.001).
Pregnant E is significantly greater than non-pregnant E for all three trimesters (all p<0.001). Among trimesters of pregnancy, we can see that T2<T3 (p<0.01), but no other significant results.
kruskal_result <- kruskal.test(e_conc_ug ~ preg_trim, data = pregnant_df)
print(kruskal_result)
##
## Kruskal-Wallis rank sum test
##
## data: e_conc_ug by preg_trim
## Kruskal-Wallis chi-squared = 196.85, df = 3, p-value < 2.2e-16
# Pairwise comparisons (if the Kruskal-Wallis test is significant)
pairwise_results <- pairwise.wilcox.test(pregnant_df$e_conc_ug,
pregnant_df$preg_trim,
p.adjust.method = "bonferroni")
print(pairwise_results)
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: pregnant_df$e_conc_ug and pregnant_df$preg_trim
##
## Not pregnant T1 T2
## T1 1.1e-13 - -
## T2 3.0e-11 0.6410 -
## T3 < 2e-16 0.8521 0.0045
##
## P value adjustment method: bonferroni
Restricting dataset to adults (age_at_sample >
mean_age_above_threshold)
lactation_df <- e_data |>
select(ind_id, month, hour, age_at_sample, e_conc_ug, lactating) |>
filter(age_at_sample > mean_age_above_threshold)
n_lactation_df<-nrow(lactation_df)
Providing both unchanged and log-transformed visualizations.
# non transformed
ggplot(data = lactation_df, aes(x = lactating, y = e_conc_ug))+
geom_boxplot()+
geom_jitter(alpha = 0.1)+
ggtitle(paste("Estradiol and lactation (N = ", n_pregnant_df, "samples)")) +
xlab("Lactating")+
ylab("Fecal estradiol (ug/g)")
# log transformed
ggplot(data = lactation_df, aes(x = lactating, y = log(e_conc_ug)))+
geom_boxplot()+
geom_jitter(alpha = 0.1)+
ggtitle(paste("Estradiol and lactation (N = ", n_pregnant_df, "samples)")) +
xlab("Lactating")+
ylab("log(Fecal estradiol [ug/g])")
Data is still not normal following log transformation
shapiro.test(lactation_df$e_conc_ug)
##
## Shapiro-Wilk normality test
##
## data: lactation_df$e_conc_ug
## W = 0.27018, p-value < 2.2e-16
shapiro.test(log(lactation_df$e_conc_ug))
##
## Shapiro-Wilk normality test
##
## data: log(lactation_df$e_conc_ug)
## W = 0.99424, p-value = 0.0003329
Data is not normal, so can’t do ANOVA. Let’s use a nonparametric approach.
We can see that E is significantly lower during lactation (W = 138521, p < 0.001).
wilcox.test(log(e_conc_ug) ~ lactating, data = lactation_df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: log(e_conc_ug) by lactating
## W = 138521, p-value = 4.561e-10
## alternative hypothesis: true location shift is not equal to 0
Let’s do a box plot looking at E across the lifecourse:
age_at_sample<
mean_age_above_threshold)is_pregnant = “N”)is_pregnant = “Y”)lactating = “YES”)I prefer the log-transformed plots, but I’ll provide both transformed and not-transformed. Let me know what you prefer!
# Prepare the data
e_data_with_lifestage <- e_data |>
mutate(
life_stage = case_when(
age_at_sample < mean_age_above_threshold ~ "Juvenility",
lactating == "YES" & age_at_sample > mean_age_above_threshold ~ "Lactating",
is_pregnant == "NP" & age_at_sample > mean_age_above_threshold & lactating != "YES" ~ "Cycling",
is_pregnant == "P" ~ "Pregnant",
TRUE ~ NA_character_ # Handle any other cases if necessary
)
)
# Create the box plot with non-transformed e
ggplot(data = e_data_with_lifestage, aes(x = life_stage, y = e_conc_ug)) +
geom_boxplot() +
geom_jitter(alpha = 0.1) +
ggtitle("Estradiol Concentration Across the Lifecourse") +
xlab("Life Stage") +
ylab("Fecal Estradiol (ug/g)") +
scale_x_discrete(limits = c("Juvenility", "Cycling", "Pregnant", "Lactating"))
# Create the box plot with log-transformed e
ggplot(data = e_data_with_lifestage, aes(x = life_stage, y = log(e_conc_ug))) +
geom_boxplot() +
geom_jitter(alpha = 0.1) +
ggtitle("Estradiol Concentration Across the Lifecourse") +
xlab("Life Stage") +
ylab("Log(Fecal Estradiol [ug/g])") +
scale_x_discrete(limits = c("Juvenility", "Cycling", "Pregnant", "Lactating"))